In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.
It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import json
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
# magic word for producing visualizations in notebook
%matplotlib inline
'''
Import note: The classroom currently uses sklearn version 0.19.
If you need to use an imputer, it is available in sklearn.preprocessing.Imputer,
instead of sklearn.impute as in newer versions of sklearn.
'''
'\nImport note: The classroom currently uses sklearn version 0.19.\nIf you need to use an imputer, it is available in sklearn.preprocessing.Imputer,\ninstead of sklearn.impute as in newer versions of sklearn.\n'
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.
Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',sep=';')
# Check the structure of the data after it's loaded (e.g. print the number of
# rows and columns, print the first few rows).
azdias.head()
| AGER_TYP | ALTERSKATEGORIE_GROB | ANREDE_KZ | CJT_GESAMTTYP | FINANZ_MINIMALIST | FINANZ_SPARER | FINANZ_VORSORGER | FINANZ_ANLEGER | FINANZ_UNAUFFAELLIGER | FINANZ_HAUSBAUER | ... | PLZ8_ANTG1 | PLZ8_ANTG2 | PLZ8_ANTG3 | PLZ8_ANTG4 | PLZ8_BAUMAX | PLZ8_HHZ | PLZ8_GBZ | ARBEIT | ORTSGR_KLS9 | RELAT_AB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 2 | 1 | 2.0 | 3 | 4 | 3 | 5 | 5 | 3 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | -1 | 1 | 2 | 5.0 | 1 | 5 | 2 | 5 | 4 | 5 | ... | 2.0 | 3.0 | 2.0 | 1.0 | 1.0 | 5.0 | 4.0 | 3.0 | 5.0 | 4.0 |
| 2 | -1 | 3 | 2 | 3.0 | 1 | 4 | 1 | 2 | 3 | 5 | ... | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 4.0 | 4.0 | 3.0 | 5.0 | 2.0 |
| 3 | 2 | 4 | 2 | 2.0 | 4 | 2 | 5 | 2 | 1 | 2 | ... | 2.0 | 2.0 | 2.0 | 0.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 3.0 |
| 4 | -1 | 3 | 1 | 5.0 | 4 | 3 | 4 | 1 | 3 | 2 | ... | 2.0 | 4.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 6.0 | 5.0 |
5 rows × 85 columns
print(azdias.shape)
(891221, 85)
feat_info.head()
| attribute | information_level | type | missing_or_unknown | |
|---|---|---|---|---|
| 0 | AGER_TYP | person | categorical | [-1,0] |
| 1 | ALTERSKATEGORIE_GROB | person | ordinal | [-1,0,9] |
| 2 | ANREDE_KZ | person | categorical | [-1,0] |
| 3 | CJT_GESAMTTYP | person | categorical | [0] |
| 4 | FINANZ_MINIMALIST | person | ordinal | [-1] |
print(feat_info.shape)
(85, 4)
I can see that dataset AZDIAS_Feature_Summary presents basic information about data in columns of dataset Udacity_AZDIAS_Subset.
Tip: Add additional cells to keep everything in reasonably-sized chunks! Keyboard shortcut
esc --> a(press escape to enter command mode, then press the 'A' key) adds a new cell before the active cell, andesc --> badds a new cell after the active cell. If you need to convert an active cell to a markdown cell, useesc --> mand to convert to a code cell, useesc --> y.
The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.
a) categorical and ordinal columns - and for each of them to show all unique values in the form of **pie chart** b) continuous-valued columns - and for each of them print **histogram**. **I know that information about the type of data in the columns and the individual values that represent the missing values for each column is in the Summary file, but I want to make sure and do this analysis by myself.**
# Let's take a help from provided info file. How this file describe main dataset's columns?
print(feat_info['type'].unique())
['categorical' 'ordinal' 'numeric' 'mixed' 'interval']
I'm not strongly sure what's behind these criteria, so I'll take a sample for previewing...
# Looking for columns which are 'categorical'
print(feat_info[feat_info['type'] == 'categorical'][:3], '\n')
# Looking for columns which are 'ordinal'
print(feat_info[feat_info['type'] == 'ordinal'][:3], '\n')
# Looking for columns which are 'numeric'
print(feat_info[feat_info['type'] == 'numeric'][:3], '\n')
# Looking for columns which are 'mixed'
print(feat_info[feat_info['type'] == 'mixed'][:3], '\n')
# Looking for columns which are 'interval'
print(feat_info[feat_info['type'] == 'interval'][:3], '\n')
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
attribute information_level type missing_or_unknown
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
4 FINANZ_MINIMALIST person ordinal [-1]
5 FINANZ_SPARER person ordinal [-1]
attribute information_level type missing_or_unknown
11 GEBURTSJAHR person numeric [0]
44 ANZ_PERSONEN household numeric []
45 ANZ_TITEL household numeric []
attribute information_level type missing_or_unknown
15 LP_LEBENSPHASE_FEIN person mixed [0]
16 LP_LEBENSPHASE_GROB person mixed [0]
22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0]
attribute information_level type missing_or_unknown
43 ALTER_HH household interval [0]
For finded examples of columns, I'll print some values that are inside of them, but also, outside of this notebook, I'll inspect "Data_Dictionary.md" file for detailed information and better understanding.
# I have samples of column names for each data category, now let's see how the data looks like in those groups of columns
# data in columns which are 'categorical'
print(azdias['AGER_TYP'][:5], '\n')
print(azdias['ANREDE_KZ'][:5], '\n')
print(azdias['CJT_GESAMTTYP'][:5], '\n')
0 -1 1 -1 2 -1 3 2 4 -1 Name: AGER_TYP, dtype: int64 0 1 1 2 2 2 3 2 4 1 Name: ANREDE_KZ, dtype: int64 0 2.0 1 5.0 2 3.0 3 2.0 4 5.0 Name: CJT_GESAMTTYP, dtype: float64
Here I was fairly sure, because the name "categorical" makes it clear how we should treat this data. In the above-printed samples, I see digital labels in int and float formats. I was expecting string formats but you can't judge that there are no such columns either. The "categorical" columns will be converted to one-hot-encoder.
# data in columns which are 'ordinal'
print(azdias['ALTERSKATEGORIE_GROB'][:5], '\n')
print(azdias['FINANZ_MINIMALIST'][:5], '\n')
print(azdias['FINANZ_SPARER'][:5], '\n')
0 2 1 1 2 3 3 4 4 3 Name: ALTERSKATEGORIE_GROB, dtype: int64 0 3 1 1 2 1 3 4 4 4 Name: FINANZ_MINIMALIST, dtype: int64 0 4 1 5 2 4 3 2 4 3 Name: FINANZ_SPARER, dtype: int64
The name "ordinal" also clearly identifies the data in the columns. These are categorical data, but such that there are dependencies and ranks between different values. Therefore, for such data, I will keep the original value numbering (at least at the stage of data cleansing, before scaling).
# data in columns which are 'numeric'
print(azdias['GEBURTSJAHR'][:5], '\n')
print(azdias['ANZ_PERSONEN'][:5], '\n')
print(azdias['ANZ_TITEL'][:5], '\n')
0 0 1 1996 2 1979 3 1957 4 1963 Name: GEBURTSJAHR, dtype: int64 0 NaN 1 2.0 2 1.0 3 0.0 4 4.0 Name: ANZ_PERSONEN, dtype: float64 0 NaN 1 0.0 2 0.0 3 0.0 4 0.0 Name: ANZ_TITEL, dtype: float64
"Numeric" columns are a combination of continuous values and discrete but strictly numerical values such as year (1974, 1999 etc.). In order not to complicate the work too much, I will not divide this set into actual continuous and ordinal data, but will treat all of them as if they were continuous data. It follows from this decision that when I further inspect the contents of the columns, I will not print or graph the values unique to those columns. To find missing or incorrect values, I will only look for anything that is not a number or for "-1" if it is so indicated by the sources.
# data in columns which are 'mixed'
print(azdias['LP_LEBENSPHASE_FEIN'][:5], '\n')
print(azdias['LP_LEBENSPHASE_GROB'][:5], '\n')
print(azdias['PRAEGENDE_JUGENDJAHRE'][:5], '\n')
0 15.0 1 21.0 2 3.0 3 0.0 4 32.0 Name: LP_LEBENSPHASE_FEIN, dtype: float64 0 4.0 1 6.0 2 1.0 3 0.0 4 10.0 Name: LP_LEBENSPHASE_GROB, dtype: float64 0 0 1 14 2 15 3 8 4 8 Name: PRAEGENDE_JUGENDJAHRE, dtype: int64
I'll rather treat data in columns "mixed" as categorical data, but confirm that after visual inspection of values in those columns.
# data in columns which are 'interval'
print(azdias['ALTER_HH'][:5], '\n')
0 NaN 1 0.0 2 17.0 3 13.0 4 20.0 Name: ALTER_HH, dtype: float64
There is only one column with 'interval' data, and I've learned from "Data_Dictionary" file, that these values are discreet numeric values for birthdate intervals. I'll treat that column as "ordinal" data.
So, as I wrote above, I would like to split the dataset columns into two groups:
a) "categorical" and "ordinal" columns
b) and columns with continuous numeric values
Based on the review above and in line with the categories from AZDIAS_Feature_Summary.csv:
a) == 'categorical', 'ordinal', 'mixed', 'interval'
b) == 'numeric'
# "Visual" identify of missing / unknown / aberrant values (and all values in general... :) )
# I only display columns from the "a)" group in the form of a pie chart
# making list of names of "a)" group columns
list_of_categorical_cols = feat_info[feat_info['type']!='numeric']['attribute']
print(list_of_categorical_cols, '\n')
for col_name in list_of_categorical_cols:
fig = px.pie(azdias[col_name].value_counts(dropna=False), values=col_name,
names = azdias[col_name].value_counts(dropna=False).index, title = col_name, template='ggplot2')
fig.show()
0 AGER_TYP
1 ALTERSKATEGORIE_GROB
2 ANREDE_KZ
3 CJT_GESAMTTYP
4 FINANZ_MINIMALIST
...
80 PLZ8_HHZ
81 PLZ8_GBZ
82 ARBEIT
83 ORTSGR_KLS9
84 RELAT_AB
Name: attribute, Length: 78, dtype: object
Now I will print the missing values given in the Summary file for each column.
# give 'numeric' at the top of the print, don't sort the rest of the type values.
#custom_dict = {'numeric': 1}
with pd.option_context('display.max_rows', None,):
#print(feat_info.sort_values(by=['type'], key=lambda x: x.map(custom_dict)))
print(feat_info)
attribute information_level type missing_or_unknown 0 AGER_TYP person categorical [-1,0] 1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9] 2 ANREDE_KZ person categorical [-1,0] 3 CJT_GESAMTTYP person categorical [0] 4 FINANZ_MINIMALIST person ordinal [-1] 5 FINANZ_SPARER person ordinal [-1] 6 FINANZ_VORSORGER person ordinal [-1] 7 FINANZ_ANLEGER person ordinal [-1] 8 FINANZ_UNAUFFAELLIGER person ordinal [-1] 9 FINANZ_HAUSBAUER person ordinal [-1] 10 FINANZTYP person categorical [-1] 11 GEBURTSJAHR person numeric [0] 12 GFK_URLAUBERTYP person categorical [] 13 GREEN_AVANTGARDE person categorical [] 14 HEALTH_TYP person ordinal [-1,0] 15 LP_LEBENSPHASE_FEIN person mixed [0] 16 LP_LEBENSPHASE_GROB person mixed [0] 17 LP_FAMILIE_FEIN person categorical [0] 18 LP_FAMILIE_GROB person categorical [0] 19 LP_STATUS_FEIN person categorical [0] 20 LP_STATUS_GROB person categorical [0] 21 NATIONALITAET_KZ person categorical [-1,0] 22 PRAEGENDE_JUGENDJAHRE person mixed [-1,0] 23 RETOURTYP_BK_S person ordinal [0] 24 SEMIO_SOZ person ordinal [-1,9] 25 SEMIO_FAM person ordinal [-1,9] 26 SEMIO_REL person ordinal [-1,9] 27 SEMIO_MAT person ordinal [-1,9] 28 SEMIO_VERT person ordinal [-1,9] 29 SEMIO_LUST person ordinal [-1,9] 30 SEMIO_ERL person ordinal [-1,9] 31 SEMIO_KULT person ordinal [-1,9] 32 SEMIO_RAT person ordinal [-1,9] 33 SEMIO_KRIT person ordinal [-1,9] 34 SEMIO_DOM person ordinal [-1,9] 35 SEMIO_KAEM person ordinal [-1,9] 36 SEMIO_PFLICHT person ordinal [-1,9] 37 SEMIO_TRADV person ordinal [-1,9] 38 SHOPPER_TYP person categorical [-1] 39 SOHO_KZ person categorical [-1] 40 TITEL_KZ person categorical [-1,0] 41 VERS_TYP person categorical [-1] 42 ZABEOTYP person categorical [-1,9] 43 ALTER_HH household interval [0] 44 ANZ_PERSONEN household numeric [] 45 ANZ_TITEL household numeric [] 46 HH_EINKOMMEN_SCORE household ordinal [-1,0] 47 KK_KUNDENTYP household categorical [-1] 48 W_KEIT_KIND_HH household ordinal [-1,0] 49 WOHNDAUER_2008 household ordinal [-1,0] 50 ANZ_HAUSHALTE_AKTIV building numeric [0] 51 ANZ_HH_TITEL building numeric [] 52 GEBAEUDETYP building categorical [-1,0] 53 KONSUMNAEHE building ordinal [] 54 MIN_GEBAEUDEJAHR building numeric [0] 55 OST_WEST_KZ building categorical [-1] 56 WOHNLAGE building mixed [-1] 57 CAMEO_DEUG_2015 microcell_rr4 categorical [-1,X] 58 CAMEO_DEU_2015 microcell_rr4 categorical [XX] 59 CAMEO_INTL_2015 microcell_rr4 mixed [-1,XX] 60 KBA05_ANTG1 microcell_rr3 ordinal [-1] 61 KBA05_ANTG2 microcell_rr3 ordinal [-1] 62 KBA05_ANTG3 microcell_rr3 ordinal [-1] 63 KBA05_ANTG4 microcell_rr3 ordinal [-1] 64 KBA05_BAUMAX microcell_rr3 mixed [-1,0] 65 KBA05_GBZ microcell_rr3 ordinal [-1,0] 66 BALLRAUM postcode ordinal [-1] 67 EWDICHTE postcode ordinal [-1] 68 INNENSTADT postcode ordinal [-1] 69 GEBAEUDETYP_RASTER region_rr1 ordinal [] 70 KKK region_rr1 ordinal [-1,0] 71 MOBI_REGIO region_rr1 ordinal [] 72 ONLINE_AFFINITAET region_rr1 ordinal [] 73 REGIOTYP region_rr1 ordinal [-1,0] 74 KBA13_ANZAHL_PKW macrocell_plz8 numeric [] 75 PLZ8_ANTG1 macrocell_plz8 ordinal [-1] 76 PLZ8_ANTG2 macrocell_plz8 ordinal [-1] 77 PLZ8_ANTG3 macrocell_plz8 ordinal [-1] 78 PLZ8_ANTG4 macrocell_plz8 ordinal [-1] 79 PLZ8_BAUMAX macrocell_plz8 mixed [-1,0] 80 PLZ8_HHZ macrocell_plz8 ordinal [-1] 81 PLZ8_GBZ macrocell_plz8 ordinal [-1] 82 ARBEIT community ordinal [-1,9] 83 ORTSGR_KLS9 community ordinal [-1,0] 84 RELAT_AB community ordinal [-1,9]
Looking at the pie charts, I HAVE NOTICED other missing values or incorrect values than those marked in the Summary file! For example: for columns like KBA05_ANTG1 to KBA05_ANTG4 there should be "-1" value for missing_unknown, as it states in summary above. Yet, looking at the pie chart for those columns, we can't see such a value, but instead "null" value exists. In this particular case it's not a problem, because I'll overwrite every missing_unknown value to NaN anyway. However, as they say, caution is never enough and I feel more confident after doing my own analysis.
# For columns "b)" (numeric) I'll show histograms.
# making list of names of "b)" group columns
list_of_numeric_cols = feat_info[feat_info['type']=='numeric']['attribute']
print(list_of_numeric_cols, '\n')
for col_name in list_of_numeric_cols:
fig = px.histogram(azdias, x=col_name)
fig.show()
11 GEBURTSJAHR 44 ANZ_PERSONEN 45 ANZ_TITEL 50 ANZ_HAUSHALTE_AKTIV 51 ANZ_HH_TITEL 54 MIN_GEBAEUDEJAHR 74 KBA13_ANZAHL_PKW Name: attribute, dtype: object